!pip install plotly
Collecting plotly
Downloading plotly-5.5.0-py2.py3-none-any.whl (26.5 MB)
|████████████████████████████████| 26.5 MB 3.3 MB/s eta 0:00:01 |████████████████▉ | 14.0 MB 2.6 MB/s eta 0:00:05
Collecting tenacity>=6.2.0
Downloading tenacity-8.0.1-py3-none-any.whl (24 kB)
Requirement already satisfied: six in /home/ab_rajiwate/anaconda3/lib/python3.8/site-packages (from plotly) (1.15.0)
Installing collected packages: tenacity, plotly
Successfully installed plotly-5.5.0 tenacity-8.0.1
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from random import randint
import joblib
import plotly.express as px
import plotly.offline as pyo
pyo.init_notebook_mode()
aisles_df = pd.read_csv('aisles.csv')
depts_df = pd.read_csv('departments.csv')
order_prod_df = pd.read_csv('order_products__prior.csv')
orders_df = pd.read_csv('cleaned_orders.csv')
orders_df.drop('Unnamed: 0',axis=1,inplace=True)
products_df = pd.read_csv('products.csv')
We will start the basic exploration of dataset before finding trends. This will help us gain better understanding of the dataset and might even lead to something totally new.
print(orders_df.shape)
orders_df.head()
(3214874, 7)
| order_id | user_id | eval_set | order_number | order_dow | order_hour_of_day | days_since_prior_order | |
|---|---|---|---|---|---|---|---|
| 0 | 2539329 | 1 | prior | 1 | 2 | 8 | 0.0 |
| 1 | 2398795 | 1 | prior | 2 | 3 | 7 | 15.0 |
| 2 | 473747 | 1 | prior | 3 | 3 | 12 | 21.0 |
| 3 | 2254736 | 1 | prior | 4 | 4 | 7 | 29.0 |
| 4 | 431534 | 1 | prior | 5 | 4 | 15 | 28.0 |
print(f'The number of users in our datset are: {orders_df.user_id.nunique()}')
The number of users in our datset are: 206209
We have exactly 206209 user data available in this dataset.
df = orders_df.groupby('user_id').count()
plt.xlabel('Range of orders placed')
sns.histplot(data = df, x= 'order_id',kde=True)
<AxesSubplot:xlabel='Range of orders placed', ylabel='Count'>
# Adjusting bins for better visualization
plt.xlabel('Range of orders placed')
sns.histplot(data = df, x= 'order_id',bins = 15)
plt.axvline(df.order_id.values.mean(),linestyle='dashed',color='red') # --> Used to put a vertical line across a graph
plt.show()
print(f'\nAverage of range of orders placed is : {round(df.order_id.values.mean(),3)}')
Average of range of orders placed is : 15.59
# Interactive Visualization
fig = px.histogram(df,x='order_id',labels={'order_id':'Range of orders placed'},title='Range of total no. of orders placed by users')
fig.show()
The range of orders is completely skewed to the left. The average is approx 16 orders. Now we know that this dataset is not being dominated by certain number of user [which was our speculation previously].
print(depts_df.shape)
depts_df.head()
(21, 2)
| department_id | department | |
|---|---|---|
| 0 | 1 | frozen |
| 1 | 2 | other |
| 2 | 3 | bakery |
| 3 | 4 | produce |
| 4 | 5 | alcohol |
print(aisles_df.shape)
aisles_df.head()
(134, 2)
| aisle_id | aisle | |
|---|---|---|
| 0 | 1 | prepared soups salads |
| 1 | 2 | specialty cheeses |
| 2 | 3 | energy granola bars |
| 3 | 4 | instant foods |
| 4 | 5 | marinades meat preparation |
print(products_df.shape)
products_df.head()
(49688, 4)
| product_id | product_name | aisle_id | department_id | |
|---|---|---|---|---|
| 0 | 1 | Chocolate Sandwich Cookies | 61 | 19 |
| 1 | 2 | All-Seasons Salt | 104 | 13 |
| 2 | 3 | Robust Golden Unsweetened Oolong Tea | 94 | 7 |
| 3 | 4 | Smart Ones Classic Favorites Mini Rigatoni Wit... | 38 | 1 |
| 4 | 5 | Green Chile Anytime Sauce | 5 | 13 |
# Merging Depts and Aisles tables in Products for easier analysis
temp = pd.merge(left=products_df, right=depts_df, how='outer' )
df = pd.merge(left=temp, right=aisles_df, how='outer')
df.head()
| product_id | product_name | aisle_id | department_id | department | aisle | |
|---|---|---|---|---|---|---|
| 0 | 1 | Chocolate Sandwich Cookies | 61 | 19 | snacks | cookies cakes |
| 1 | 78 | Nutter Butter Cookie Bites Go-Pak | 61 | 19 | snacks | cookies cakes |
| 2 | 102 | Danish Butter Cookies | 61 | 19 | snacks | cookies cakes |
| 3 | 172 | Gluten Free All Natural Chocolate Chip Cookies | 61 | 19 | snacks | cookies cakes |
| 4 | 285 | Mini Nilla Wafers Munch Pack | 61 | 19 | snacks | cookies cakes |
Now that we have merged all the tables, lets check for any null values
df.isnull().any()
product_id False product_name False aisle_id False department_id False department False aisle False dtype: bool
No null values, nice! Let's move on to completing the main task
Aisle count for each department
for i in df.department.unique():
count = df[df.department == i].aisle_id.nunique()
print(f"{i} :: {count}")
print('-------------------------------------')
snacks :: 11 ------------------------------------- pantry :: 12 ------------------------------------- beverages :: 8 ------------------------------------- frozen :: 11 ------------------------------------- personal care :: 17 ------------------------------------- dairy eggs :: 10 ------------------------------------- household :: 10 ------------------------------------- babies :: 4 ------------------------------------- meat seafood :: 7 ------------------------------------- dry goods pasta :: 5 ------------------------------------- pets :: 2 ------------------------------------- breakfast :: 4 ------------------------------------- canned goods :: 5 ------------------------------------- produce :: 5 ------------------------------------- missing :: 1 ------------------------------------- international :: 4 ------------------------------------- deli :: 5 ------------------------------------- alcohol :: 5 ------------------------------------- bakery :: 5 ------------------------------------- other :: 1 ------------------------------------- bulk :: 2 -------------------------------------
Hmm... there is department named missing,lets explore this issue in depth.
df[df.department == 'missing'].head()
| product_id | product_name | aisle_id | department_id | department | aisle | |
|---|---|---|---|---|---|---|
| 42813 | 38 | Ultra Antibacterial Dish Liquid | 100 | 21 | missing | missing |
| 42814 | 72 | Organic Honeycrisp Apples | 100 | 21 | missing | missing |
| 42815 | 110 | Uncured Turkey Bologna | 100 | 21 | missing | missing |
| 42816 | 297 | Write Bros Ball Point Pens, Cap-Pen, Medium (1... | 100 | 21 | missing | missing |
| 42817 | 417 | Classics Baby Binks Easter Chocolate Bunny | 100 | 21 | missing | missing |
As we can see both department and aisle names are missing for these products. Lets check the total number of departments and aisles having "missing" as their names.
Speculation: There is only going to be one aisle and department named as missing. We can confidently say this because during the Data Cleaning phase we had checked the aisles.csv and departments.csv datasets for duplicate values and found none. Hence there is a chance that aisle_id 100 & department_id 21 are the only aisle and department having the name missing.
if df[df.department == 'missing'].aisle_id.nunique() == 1 and df[df.department == 'missing'].aisle.nunique() == 1:
print('The above speculation is True')
The above speculation is True
Even though their aisle and department names are missing, it does not mean they are not present in the transactional dataset. We already know that few products are missing from the main transactional dataset. We wil check later how many products belong to aisle 100.
Product count for each aisle
count = []
for i in df.aisle.unique():
count.append(df[df.aisle == i].product_id.nunique())
temp = pd.DataFrame({'aisle':df.aisle.unique(),
'prod_freq':count})
temp.head()
| aisle | prod_freq | |
|---|---|---|
| 0 | cookies cakes | 874 |
| 1 | ice cream toppings | 85 |
| 2 | energy granola bars | 832 |
| 3 | chips pretzels | 989 |
| 4 | crackers | 747 |
temp.sort_values(by='prod_freq',ascending=False).head(15)
| aisle | prod_freq | |
|---|---|---|
| 111 | missing | 1258 |
| 7 | candy chocolate | 1246 |
| 37 | ice cream ice | 1091 |
| 44 | vitamins supplements | 1038 |
| 59 | yogurt | 1026 |
| 3 | chips pretzels | 989 |
| 23 | tea | 894 |
| 61 | packaged cheese | 891 |
| 31 | frozen meals | 880 |
| 0 | cookies cakes | 874 |
| 2 | energy granola bars | 832 |
| 46 | hair care | 816 |
| 11 | spices seasonings | 797 |
| 24 | juice nectars | 792 |
| 4 | crackers | 747 |
Missing Aisle i.e. aisle 100 has the highest number of products which is then followed by candy chocolate. From top 15 aisle we can see that this online grocery store serves in variety of aisles, each having a large variety of products.
temp.sort_values(by='prod_freq',ascending=True).head(15)
| aisle | prod_freq | |
|---|---|---|
| 133 | bulk dried fruits vegetables | 12 |
| 132 | bulk grains rice dried goods | 26 |
| 110 | packaged produce | 32 |
| 81 | baby accessories | 44 |
| 41 | frozen juice | 47 |
| 89 | seafood counter | 54 |
| 10 | trail mix snack mix | 69 |
| 88 | packaged seafood | 80 |
| 35 | frozen breads doughs | 81 |
| 83 | poultry counter | 82 |
| 1 | ice cream toppings | 85 |
| 109 | fresh herbs | 86 |
| 121 | specialty wines champagnes | 95 |
| 67 | refrigerated pudding desserts | 98 |
| 84 | packaged poultry | 99 |
Top 15 aisles with least amount of products. Apart from the first aisle i.e. "bulk dried fruits vegetables" rest have good amount of products associated with them.
temp_prod = set(products_df.product_id.unique())
temp_order_prod = set(order_prod_df.product_id.unique())
missing_prod = temp_prod - temp_order_prod
print(f'Total missing products from order_prod_df are: {len(missing_prod)}')
print(f"Missing Products are displayed below:\n\t{missing_prod}")
Total missing products from order_prod_df are: 11
Missing Products are displayed below:
{46625, 49540, 7045, 3718, 25383, 37703, 36233, 27499, 43725, 3630, 45971}
temp = df[df.product_id == 46625]
for i in list(missing_prod)[1:]:
temp = pd.concat([df[df.product_id == i],temp])
print(temp.shape)
temp
(11, 6)
| product_id | product_name | aisle_id | department_id | department | aisle | |
|---|---|---|---|---|---|---|
| 32630 | 45971 | 12 Inch Taper Candle White | 101 | 17 | household | air fresheners candles |
| 38861 | 3630 | Protein Granola Apple Crisp | 57 | 14 | breakfast | granola |
| 43916 | 43725 | Sweetart Jelly Beans | 100 | 21 | missing | missing |
| 43501 | 27499 | Non-Dairy Coconut Seven Layer Bar | 100 | 21 | missing | missing |
| 43706 | 36233 | Water With Electrolytes | 100 | 21 | missing | missing |
| 31337 | 37703 | Ultra Sun Blossom Liquid 90 loads Fabric Enhan... | 75 | 17 | household | laundry |
| 447 | 25383 | Chocolate Go Bites | 61 | 19 | snacks | cookies cakes |
| 27931 | 3718 | Wasabi Cheddar Spreadable Cheese | 21 | 16 | dairy eggs | packaged cheese |
| 8574 | 7045 | Unpeeled Apricot Halves in Heavy Syrup | 88 | 13 | pantry | spreads |
| 14338 | 49540 | Pure Squeezed Lemonade | 31 | 7 | beverages | refrigerated |
| 14299 | 46625 | Single Barrel Kentucky Straight Bourbon Whiskey | 31 | 7 | beverages | refrigerated |
for i in temp.department.unique():
print(f'Department: {i}')
print(f'{temp[temp.department == i].aisle.value_counts()}')
print('-'*100)
Department: household laundry 1 air fresheners candles 1 Name: aisle, dtype: int64 ---------------------------------------------------------------------------------------------------- Department: breakfast granola 1 Name: aisle, dtype: int64 ---------------------------------------------------------------------------------------------------- Department: missing missing 3 Name: aisle, dtype: int64 ---------------------------------------------------------------------------------------------------- Department: snacks cookies cakes 1 Name: aisle, dtype: int64 ---------------------------------------------------------------------------------------------------- Department: dairy eggs packaged cheese 1 Name: aisle, dtype: int64 ---------------------------------------------------------------------------------------------------- Department: pantry spreads 1 Name: aisle, dtype: int64 ---------------------------------------------------------------------------------------------------- Department: beverages refrigerated 2 Name: aisle, dtype: int64 ----------------------------------------------------------------------------------------------------
Only 3 products are absent from the dept and aisle with missing name. We can also see that total of 11 products are not present in the main transactional dataset. To further investigate upon as to why these products are not present, this dataset is not suffucient to answer that and will require further investigation with the stakeholders.
The following are the questions that will discover all the information stored in this dataset. This information can be used by Stakeholders or Marketing teams in ways suiting their needs.
We have already completed this process previously and have also saved the results in a csv file. All we need to do is add product names and our result will be ready
grouped_df = pd.read_csv('Order_Prod_Df_Freq.csv')
grouped_df.drop('Unnamed: 0',axis=1,inplace = True)
grouped_df = pd.merge(left=grouped_df,right=products_df,how='outer')
#grouped_df.drop(['aisle_id','department_id'],axis=1,inplace=True)
grouped_df.head()
| product_id | frequency | product_name | aisle_id | department_id | |
|---|---|---|---|---|---|
| 0 | 1 | 1852.0 | Chocolate Sandwich Cookies | 61 | 19 |
| 1 | 2 | 90.0 | All-Seasons Salt | 104 | 13 |
| 2 | 3 | 277.0 | Robust Golden Unsweetened Oolong Tea | 94 | 7 |
| 3 | 4 | 329.0 | Smart Ones Classic Favorites Mini Rigatoni Wit... | 38 | 1 |
| 4 | 5 | 15.0 | Green Chile Anytime Sauce | 5 | 13 |
Now we know that 11 products are missing from the transactional dataset. And as we have performed the outer join, there must be some null values.
grouped_df.isnull().any()
product_id False frequency True product_name False aisle_id False department_id False dtype: bool
Seems like frequency has null values. The reason only the frequency column has null values is because all product id were present in the product dataset but all products frequency were not present in the transactional dataset hence frequency has null values
print(grouped_df[grouped_df.frequency.isnull()].shape)
grouped_df[grouped_df.frequency.isnull()]
(11, 5)
| product_id | frequency | product_name | aisle_id | department_id | |
|---|---|---|---|---|---|
| 49677 | 3630 | NaN | Protein Granola Apple Crisp | 57 | 14 |
| 49678 | 3718 | NaN | Wasabi Cheddar Spreadable Cheese | 21 | 16 |
| 49679 | 7045 | NaN | Unpeeled Apricot Halves in Heavy Syrup | 88 | 13 |
| 49680 | 25383 | NaN | Chocolate Go Bites | 61 | 19 |
| 49681 | 27499 | NaN | Non-Dairy Coconut Seven Layer Bar | 100 | 21 |
| 49682 | 36233 | NaN | Water With Electrolytes | 100 | 21 |
| 49683 | 37703 | NaN | Ultra Sun Blossom Liquid 90 loads Fabric Enhan... | 75 | 17 |
| 49684 | 43725 | NaN | Sweetart Jelly Beans | 100 | 21 |
| 49685 | 45971 | NaN | 12 Inch Taper Candle White | 101 | 17 |
| 49686 | 46625 | NaN | Single Barrel Kentucky Straight Bourbon Whiskey | 31 | 7 |
| 49687 | 49540 | NaN | Pure Squeezed Lemonade | 31 | 7 |
The above output shows exactly 11 products that were not present in the main transactional dataset and hence there frequency is missing. We will fill nan values with zero
grouped_df.fillna(0,inplace=True)
#Checking whether previous operation was successful or not
grouped_df.isnull().any()
product_id False frequency False product_name False aisle_id False department_id False dtype: bool
# Rearranging columns
grouped_df = grouped_df.reindex(columns=['product_id','product_name','aisle_id','department_id','frequency'])
# Saving grouped_df
grouped_df.to_csv('Product_Selling_Ranking.csv')
Answering this question will show us the most active days within a week.
orders_df.head()
| order_id | user_id | eval_set | order_number | order_dow | order_hour_of_day | days_since_prior_order | |
|---|---|---|---|---|---|---|---|
| 0 | 2539329 | 1 | prior | 1 | 2 | 8 | 0.0 |
| 1 | 2398795 | 1 | prior | 2 | 3 | 7 | 15.0 |
| 2 | 473747 | 1 | prior | 3 | 3 | 12 | 21.0 |
| 3 | 2254736 | 1 | prior | 4 | 4 | 7 | 29.0 |
| 4 | 431534 | 1 | prior | 5 | 4 | 15 | 28.0 |
sns.countplot(x='order_dow',data=orders_df).set(title='Activity Ranking for days in a week',xlabel='Days of week')
plt.show()
Given the above graph, we can discern that 0 and 1 are weekends. 0 is Saturday and 1 is Sunday. 0 and 1 both have the high and almost the same activity level. While from 2 to 6 seem to have fairly the same amount of activity level.
temp_df = orders_df[orders_df.order_dow == 0]
sns.countplot(data = temp_df, x='order_hour_of_day').set(title='Activity Ranking for Day 0',xlabel='Hours in a Day')
plt.show()
temp_df = orders_df[orders_df.order_dow == 1]
sns.countplot(data = temp_df, x='order_hour_of_day').set(title='Activity Ranking for Day 1',xlabel='Hours in a Day')
plt.show()
temp_df = orders_df[orders_df.order_dow == 2]
sns.countplot(data = temp_df, x='order_hour_of_day').set(title='Activity Ranking for Day 2',xlabel='Hours in a Day')
plt.show()
temp_df = orders_df[orders_df.order_dow == 3]
sns.countplot(data = temp_df, x='order_hour_of_day').set(title='Activity Ranking for Day 3',xlabel='Hours in a Day')
plt.show()
temp_df = orders_df[orders_df.order_dow == 4]
sns.countplot(data = temp_df, x='order_hour_of_day').set(title='Activity Ranking for Day 4',xlabel='Hours in a Day')
plt.show()
temp_df = orders_df[orders_df.order_dow == 5]
sns.countplot(data = temp_df, x='order_hour_of_day').set(title='Activity Ranking for Day 5',xlabel='Hours in a Day')
plt.show()
temp_df = orders_df[orders_df.order_dow == 6]
sns.countplot(data = temp_df, x='order_hour_of_day').set(title='Activity Ranking for Day 6',xlabel='Hours in a Day')
plt.show()
order_prod_df.head()
| order_id | product_id | add_to_cart_order | reordered | |
|---|---|---|---|---|
| 0 | 2 | 33120 | 1 | 1 |
| 1 | 2 | 28985 | 2 | 1 |
| 2 | 2 | 9327 | 3 | 0 |
| 3 | 2 | 45918 | 4 | 1 |
| 4 | 2 | 30035 | 5 | 0 |
order_prod_df.product_id.nunique()
49677
products_df.shape
(49688, 4)
prod_id = order_prod_df[order_prod_df.reordered == 1].product_id.value_counts().index
reord_freq = order_prod_df[order_prod_df.reordered == 1].product_id.value_counts().values
temp_df = pd.DataFrame({'product_id':prod_id,
'reorder_count':reord_freq})
print(temp_df.shape)
temp_df.head(10)
(45305, 2)
| product_id | reorder_count | |
|---|---|---|
| 0 | 24852 | 398609 |
| 1 | 13176 | 315913 |
| 2 | 21137 | 205845 |
| 3 | 21903 | 186884 |
| 4 | 47209 | 170131 |
| 5 | 47766 | 134044 |
| 6 | 27845 | 114510 |
| 7 | 47626 | 106255 |
| 8 | 27966 | 105409 |
| 9 | 16797 | 99802 |
There seem to be a lot of products missing. This may because these products were filltered out as they were never reordered. Lets confirm this speculation.
if temp_df.shape[0] == order_prod_df[order_prod_df.reordered == 1].product_id.nunique():
print('The speculation is correct')
The speculation is correct
Lets get product names for the corresponding product ids. And save the dataframe in a csv file format
temp_df = pd.merge(left=temp_df,right=products_df,how='inner')
temp_df = temp_df.reindex(columns=['product_id','product_name','aisle_id','department_id','reorder_count'])
temp_df.sort_values(by='product_id',inplace=True)
print(temp_df.shape)
temp_df.head(10)
(45305, 5)
| product_id | product_name | aisle_id | department_id | reorder_count | |
|---|---|---|---|---|---|
| 2679 | 1 | Chocolate Sandwich Cookies | 61 | 19 | 1136 |
| 29871 | 2 | All-Seasons Salt | 104 | 13 | 12 |
| 9084 | 3 | Robust Golden Unsweetened Oolong Tea | 94 | 7 | 203 |
| 10856 | 4 | Smart Ones Classic Favorites Mini Rigatoni Wit... | 38 | 1 | 147 |
| 31880 | 5 | Green Chile Anytime Sauce | 5 | 13 | 9 |
| 38207 | 6 | Dry Nose Oil | 11 | 11 | 3 |
| 29790 | 7 | Pure Coconut Water With Orange | 98 | 7 | 12 |
| 14539 | 8 | Cut Russet Potatoes Steam N' Mash | 116 | 1 | 83 |
| 14660 | 9 | Light Strawberry Blueberry Yogurt | 120 | 16 | 82 |
| 2373 | 10 | Sparkling Orange Juice & Prickly Pear Beverage | 115 | 7 | 1304 |
temp_df.to_csv('Product_Reorder_Ranking.csv')
Answering this question will helps us identify products that are ordered within a certain/small period of time. Identifying them will help the retailer to stock up on these products and avoid huge losses.
px.histogram(orders_df,x='days_since_prior_order',labels={'days_since_prior_order':'Number of Days'},title='Days Since Prior Order')